Single-Sample cfDNA WGBS Analysis Report

cfDNA WGBS Analysis and Prediction

Author

Qi Yan

Published

January 30, 2026

1 Introduction

This report presents an representative analysis of a single cell-free DNA (cfDNA) sample from whole-genome bisulfite sequencing (WGBS) data. The analysis follows the same algorithms and methods used in the ALS vs Control study to ensure reproducibility and comparability.

1.1 Analysis Overview

The pipeline generates the following outputs:

  1. Quality Control Metrics: Mapping statistics, bisulfite conversion efficiency, and coverage metrics
  2. Fragment Size Analysis: Insert size distributions showing the characteristic cfDNA nucleosome pattern
  3. Fragmentation Ratio Track: Genome-wide short/long fragment ratio visualization
  4. Genomic Distribution: Distribution of fragment 5’ start sites across genomic features
  5. TSS Enrichment: Nucleosome positioning signal around transcription start sites
  6. End Motif Analysis: 5’ end 4-mer motif frequencies reflecting nuclease preferences
  7. Classification Prediction: Predicted group (ALS vs Ctrl) using pre-trained Random Forest model

1.2 Input Files

This notebook requires two input files:

  • Bismark-aligned BAM file: Deduplicated BAM with XM methylation tags
  • Bismark coverage file: Output from bismark_methylation_extractor (.bismark.cov.gz)

If you only have a Bismark-aligned BAM file, run the following command to extract methylation calls:

bismark_methylation_extractor --gzip --bedGraph \
  --cytosine_report --genome_folder /path/to/genome \
  your_sample.bam

For detailed documentation, see: Bismark Methylation Extraction

2 Dependencies

This section lists all software dependencies required to run this analysis.

2.1 R Packages

2.1.1 CRAN Packages

Required CRAN packages
Package Version Purpose
ggplot2 >=3.4.0 Data visualization
dplyr >=1.1.0 Data manipulation
tidyr >=1.3.0 Data tidying
readr >=2.1.0 File I/O
purrr >=1.0.0 Functional programming
stringr >=1.5.0 String manipulation
tibble >=3.2.0 Modern data frames
patchwork >=1.1.0 Plot composition
here >=1.0.0 Project-relative paths
scales >=1.2.0 Axis formatting
randomForest >=4.7 Classification model
knitr >=1.40 Report generation
kableExtra >=1.3.0 Table formatting

2.1.2 Bioconductor Packages

Required Bioconductor packages
Package Version Purpose
Rsamtools >=2.14.0 BAM file handling
cigarillo >=1.0.0 CIGAR string parsing
GenomicRanges >=1.50.0 Genomic intervals
IRanges >=2.32.0 Integer ranges
S4Vectors >=0.36.0 S4 vector infrastructure
Biostrings >=2.66.0 DNA sequence handling
BSgenome.Hsapiens.UCSC.hg38 >=1.4.0 Human reference genome (hg38)
GenomeInfoDb >=1.34.0 Genome metadata
TxDb.Hsapiens.UCSC.hg38.knownGene >=3.16.0 Gene annotations (hg38)
GenomicFeatures >=1.50.0 Genomic features
ChIPseeker >=1.34.0 Peak annotation
org.Hs.eg.db >=3.16.0 Human gene IDs
AnnotationDbi >=1.60.0 Annotation infrastructure
bsseq >=1.34.0 Bisulfite-seq analysis
Gviz >=1.42.0 Genomic track visualization

2.2 External Tools

Required external tools
Tool Version Purpose Link
Bismark >=0.24.0 Bisulfite read alignment & methylation extraction [GitHub](ht
samtools >=1.17 BAM file manipulation (indexing) [GitHub](ht
Quarto >=1.3.0 Document rendering [quarto.org
# CRAN packages
install.packages(c("ggplot2", "dplyr", "tidyr", "readr", "purrr", 
                   "stringr", "tibble", "patchwork", "here", "scales",
                   "randomForest", "knitr", "kableExtra"))

# Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c(
  "Rsamtools", "cigarillo", "GenomicRanges", "IRanges", "S4Vectors",
  "Biostrings", "BSgenome.Hsapiens.UCSC.hg38", "GenomeInfoDb",
  "TxDb.Hsapiens.UCSC.hg38.knownGene", "GenomicFeatures",
  "ChIPseeker", "org.Hs.eg.db", "AnnotationDbi",
  "bsseq", "Gviz"
))

3 Setup and Input Validation

Show code
# Load required packages
suppressPackageStartupMessages({
  # CRAN packages
  library(ggplot2)
  library(dplyr)
  library(tidyr)
  library(readr)
  library(purrr)
  library(stringr)
  library(tibble)
  library(patchwork)
  library(here)
  library(knitr)
  library(kableExtra)
  library(scales)
  library(randomForest)
  
  # Bioconductor packages
  library(Rsamtools)
  library(cigarillo)
  library(GenomicRanges)
  library(IRanges)
  library(S4Vectors)
  library(Biostrings)
  library(BSgenome.Hsapiens.UCSC.hg38)
  library(GenomeInfoDb)
  library(TxDb.Hsapiens.UCSC.hg38.knownGene)
  library(GenomicFeatures)
  library(ChIPseeker)
  library(org.Hs.eg.db)
  library(AnnotationDbi)
  library(bsseq)
  library(Gviz)
})

# Load analysis functions (from same directory as this notebook)
# The functions file should be in the same directory as this .qmd file
source("cfDNA_analysis_functions.R")

# Set seed for reproducibility
set.seed(389)
Show code
# Parameters from YAML
BAM_FILE <- params$bam_file
BISMARK_COV_FILE <- params$bismark_cov_file
SAMPLE_ID <- params$sample_id
OUTPUT_DIR <- params$output_dir
PROJECT_DIR <- params$project_dir

# Create output directory
if (!dir.exists(OUTPUT_DIR)) {
  dir.create(OUTPUT_DIR, recursive = TRUE, showWarnings = FALSE)
}

# Temporary directory for intermediate files
TEMP_DIR <- file.path(OUTPUT_DIR, "temp")
if (!dir.exists(TEMP_DIR)) {
  dir.create(TEMP_DIR, recursive = TRUE, showWarnings = FALSE)
}

# Load reference data
genome <- BSgenome.Hsapiens.UCSC.hg38
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

# Path to pre-trained model (from cohort analysis)
MODEL_PATH <- file.path(PROJECT_DIR, "data", "processed", "methylation", "rds", "nested_cv_results.rds")
STACKHMM_PATH <- file.path(PROJECT_DIR, "data", "processed", "hg38_genome_100_segments.bed")
Show code
# Validate input files
validation_results <- tibble::tibble(
  File = c("BAM file", "Bismark coverage file", "Pre-trained model", "stackHMM annotations"),
  Path = c(BAM_FILE, BISMARK_COV_FILE, MODEL_PATH, STACKHMM_PATH),
  Exists = c(file.exists(BAM_FILE), file.exists(BISMARK_COV_FILE), 
             file.exists(MODEL_PATH), file.exists(STACKHMM_PATH))
)

knitr::kable(validation_results, caption = "Input file validation") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  kableExtra::row_spec(which(!validation_results$Exists), bold = TRUE, color = "white", background = "#E64B35")
Input file validation
File Path Exists
BAM file /Users/qyan/Library/CloudStorage/GoogleDrive-qyan@altoslabs.com/My Drive/Documents/Job/PrimaMente_Bioinfo/Interview/Take_home_task/cfDNA_WGBS_ALS_GSE164600/data/raw/SRR13404367.deduplicated.sorted_ds10mill_chr21.bam TRUE
Bismark coverage file /Users/qyan/Library/CloudStorage/GoogleDrive-qyan@altoslabs.com/My Drive/Documents/Job/PrimaMente_Bioinfo/Interview/Take_home_task/cfDNA_WGBS_ALS_GSE164600/data/processed/methylation_extractor/SRR13404367.deduplicated.sorted_ds10mill_chr21.name_sorted/SRR13404367.deduplicated.sorted_ds10mill_chr21.name_sorted.bismark.cov.gz TRUE
Pre-trained model ../data/processed/methylation/rds/nested_cv_results.rds TRUE
stackHMM annotations ../data/processed/hg38_genome_100_segments.bed TRUE
Show code
# Stop if required files are missing
required_files <- c(BAM_FILE, BISMARK_COV_FILE)
missing_required <- required_files[!file.exists(required_files)]
if (length(missing_required) > 0) {
  stop("Missing required input files:\n", paste("  -", missing_required, collapse = "\n"))
}

4 QC Metrics

This section extracts quality control metrics from the BAM file, including mapping statistics, fragment size distribution, and methylation metrics.

Show code
# Extract BAM metrics (this generates the fragment BED file as well)
message("Extracting BAM metrics...")
metrics <- extract_bam_metrics(
  bam_path = BAM_FILE,
  sample_id = SAMPLE_ID,
  genome = genome,
  chunk_size = 1e6,
  frag_dir = TEMP_DIR
)

# Calculate summary statistics
summary_stats <- calculate_summary_stats(metrics)
Show code
# Format key metrics for display
qc_display <- tibble::tibble(
  Metric = c(
    "Total Reads",
    "Mapped Reads",
    "Mapping Rate",
    "Properly Paired",
    "Proper Pair Rate",
    "Duplicates",
    "Duplicate Rate",
    "Filtered Reads (MAPQ≥30)",
    "Filter Pass Rate",
    "Mean MAPQ",
    "Number of Fragments",
    "Mean Fragment Length",
    "Median Fragment Length",
    "Mode Fragment Length",
    "GC Content",
    "CpG Methylation",
    "Bisulfite Conversion"
  ),
  Value = c(
    format(summary_stats$total_reads, big.mark = ","),
    format(summary_stats$mapped_reads, big.mark = ","),
    sprintf("%.2f%%", summary_stats$mapping_rate * 100),
    format(summary_stats$properly_paired, big.mark = ","),
    sprintf("%.2f%%", summary_stats$proper_pair_rate * 100),
    format(summary_stats$duplicates, big.mark = ","),
    sprintf("%.2f%%", summary_stats$duplicate_rate * 100),
    format(summary_stats$filtered_reads, big.mark = ","),
    sprintf("%.2f%%", summary_stats$filter_pass_rate * 100),
    sprintf("%.1f", summary_stats$mean_mapq),
    format(summary_stats$n_fragments, big.mark = ","),
    sprintf("%.1f bp", summary_stats$mean_frag_length),
    sprintf("%.0f bp", summary_stats$median_frag_length),
    sprintf("%.0f bp", summary_stats$frag_mode),
    sprintf("%.2f%%", summary_stats$gc_content * 100),
    sprintf("%.2f%%", summary_stats$cpg_methylation * 100),
    sprintf("%.2f%%", summary_stats$bisulfite_conversion * 100)
  )
)

knitr::kable(qc_display, caption = paste("QC Summary for", SAMPLE_ID)) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  kableExtra::pack_rows("Alignment Statistics", 1, 9) %>%
  kableExtra::pack_rows("Fragment Statistics", 10, 14) %>%
  kableExtra::pack_rows("Methylation Statistics", 15, 17)
QC Summary for SRR13404367
Metric Value
Alignment Statistics
Total Reads 128,966
Mapped Reads 128,966
Mapping Rate 100.00%
Properly Paired 128,966
Proper Pair Rate 100.00%
Duplicates 0
Duplicate Rate 0.00%
Filtered Reads (MAPQ≥30) 111,610
Filter Pass Rate 86.54%
Fragment Statistics
Mean MAPQ 38.1
Number of Fragments 55,805
Mean Fragment Length 174.4 bp
Median Fragment Length 162 bp
Mode Fragment Length 160 bp
Methylation Statistics
GC Content 38.49%
CpG Methylation 82.71%
Bisulfite Conversion 99.53%

4.1 QC Metrics Visualization

Show code
# Prepare data for bar chart
qc_plot_data <- tibble::tibble(
  metric = c("Total Reads (M)", "Filtered Reads (M)", "Mean MAPQ", 
             "Median Fragment (bp)", "GC Content (%)", 
             "CpG Methylation (%)", "Bisulfite Conv. (%)"),
  value = c(
    summary_stats$total_reads / 1e6,
    summary_stats$filtered_reads / 1e6,
    summary_stats$mean_mapq,
    summary_stats$median_frag_length,
    summary_stats$gc_content * 100,
    summary_stats$cpg_methylation * 100,
    summary_stats$bisulfite_conversion * 100
  ),
  category = c("Reads", "Reads", "Quality", "Fragment", "Fragment", "Methylation", "Methylation")
) %>%
  dplyr::mutate(
    metric = factor(metric, levels = rev(metric)),
    label = dplyr::case_when(
      grepl("Reads", metric) ~ sprintf("%.1fM", value),
      grepl("MAPQ", metric) ~ sprintf("%.1f", value),
      grepl("Fragment", metric) & !grepl("%", metric) ~ sprintf("%.0f bp", value),
      TRUE ~ sprintf("%.1f%%", value)
    )
  )

# Create horizontal bar chart
p_qc <- ggplot(qc_plot_data, aes(x = value, y = metric, fill = category)) +
  geom_col(width = 0.7, alpha = 0.9) +
  geom_text(aes(label = label), hjust = -0.1, size = 3.5, fontface = "bold") +
  scale_fill_brewer(palette = "Set2") +
  scale_x_continuous(expand = expansion(mult = c(0, 0.25))) +
  labs(
    title = paste("QC Metrics:", SAMPLE_ID),
    subtitle = "Key quality indicators from BAM file analysis",
    x = "Value",
    y = NULL,
    fill = "Category"
  ) +
  theme_pub(base_size = 12) +
  theme(
    legend.position = "top",
    panel.grid.major.y = element_blank()
  )

p_qc
Figure 1: Key QC metrics for the sample

5 Fragmentation Analysis

Cell-free DNA exhibits characteristic insert size patterns reflecting nucleosome protection. This section analyzes the insert size distribution and genome-wide short/long fragment ratio.

5.1 Insert Size Distribution

Show code
# Prepare fragment length data
frag_df <- tibble::tibble(
  fragment_length = metrics$fragment_lengths
) %>%
  dplyr::filter(fragment_length >= 80, fragment_length <= 450)

# Count fragments by length
counts_saw <- frag_df %>%
  dplyr::count(fragment_length, name = "n")

# Calculate mode for annotation
calc_mode <- function(x) {
  x <- x[is.finite(x)]
  if (length(x) == 0) return(NA_integer_)
  x <- as.integer(x)
  tabulate(x, nbins = max(x)) |> which.max()
}

mode_bp <- calc_mode(metrics$fragment_lengths)
median_bp <- median(metrics$fragment_lengths, na.rm = TRUE)
short_n <- sum(metrics$fragment_lengths >= 100 & metrics$fragment_lengths <= 150)
long_n <- sum(metrics$fragment_lengths >= 151 & metrics$fragment_lengths <= 220)
short_long_ratio <- if (long_n > 0) short_n / long_n else NA

# Create sawtooth plot
p_sawtooth <- ggplot(counts_saw, aes(x = fragment_length, y = n)) +
  geom_line(linewidth = 0.5, color = "gray20") +
  geom_vline(xintercept = 167, linetype = "dashed", color = "#E64B35", linewidth = 0.6) +
  geom_vline(xintercept = 334, linetype = "dashed", color = "#4DBBD5", linewidth = 0.6) +
  annotate("text", x = 175, y = max(counts_saw$n) * 0.95, 
           label = "Mono-nucleosome\n(~167 bp)", hjust = 0, size = 3, color = "#E64B35") +
  annotate("text", x = 342, y = max(counts_saw$n) * 0.5, 
           label = "Di-nucleosome\n(~334 bp)", hjust = 0, size = 3, color = "#4DBBD5") +
  annotate("label", x = 380, y = max(counts_saw$n) * 0.85,
           label = sprintf("Median: %d bp\nMode: %d bp\nS/L ratio: %.3f", 
                          as.integer(median_bp), mode_bp, short_long_ratio),
           hjust = 0, size = 3, fill = "white", label.size = 0.3) +
  labs(
    title = "cfDNA Fragment Size Distribution",
    subtitle = paste("Sample:", SAMPLE_ID),
    x = "Fragment Length (bp)",
    y = "Count"
  ) +
  theme_pub(base_size = 12)

p_sawtooth
Figure 2: Fragment size distribution showing characteristic cfDNA nucleosome pattern

5.2 Fragmentation Ratio Track

The short/long fragment ratio varies across the genome and can reveal nucleosome positioning differences.

Show code
# Set up genomic bins (chr21 only for efficiency)
bin_size <- 2e6
chroms <- "chr21"
seqlens <- GenomeInfoDb::seqlengths(genome)[chroms]
seqlens <- seqlens[!is.na(seqlens)]

bins_gr <- GenomicRanges::tileGenome(
  seqlengths = seqlens,
  tilewidth = bin_size,
  cut.last.tile.in.chrom = TRUE
)

bins_df <- tibble::tibble(
  bin_id = seq_along(bins_gr),
  chrom = as.character(GenomeInfoDb::seqnames(bins_gr)),
  start = start(bins_gr),
  end = end(bins_gr),
  mid = as.integer(floor((start + end) / 2))
)

# GC content per bin
bin_seqs <- Biostrings::getSeq(genome, bins_gr)
freq <- Biostrings::letterFrequency(bin_seqs, letters = c("A", "C", "G", "T", "N"), as.prob = FALSE)
acgt <- rowSums(freq[, c("A", "C", "G", "T"), drop = FALSE])
gc <- (freq[, "G"] + freq[, "C"]) / pmax(acgt, 1)
n_frac <- freq[, "N"] / Biostrings::width(bin_seqs)

bins_df <- bins_df %>%
  dplyr::mutate(gc = as.numeric(gc), n_frac = as.numeric(n_frac))

# GC filtering thresholds
gc_min <- 0.10
gc_max <- 0.85
n_frac_max <- 0.45
bins_keep_for_gc <- is.finite(bins_df$gc) &
  bins_df$gc >= gc_min & bins_df$gc <= gc_max &
  is.finite(bins_df$n_frac) & bins_df$n_frac <= n_frac_max

# Compute bin counts
fragment_bed_path <- file.path(TEMP_DIR, paste0(SAMPLE_ID, ".fragments.bed.gz"))
bin_tracks <- bin_counts_one_sample(
  path = fragment_bed_path,
  bins_gr = bins_gr,
  bins_df = bins_df,
  bins_keep_for_gc = bins_keep_for_gc
)

bin_tracks <- bin_tracks %>%
  dplyr::left_join(bins_df, by = "bin_id")
Show code
# Plot fragmentation ratio track
p_frag_track <- ggplot(bin_tracks, aes(x = mid / 1e6, y = short_long_ratio)) +
  geom_hline(yintercept = 1, color = "gray70", linewidth = 0.35) +
  geom_line(linewidth = 0.8, color = "#2E4A62") +
  geom_point(size = 1.5, color = "#2E4A62", alpha = 0.7) +
  scale_x_continuous(breaks = seq(0, 50, by = 10)) +
  labs(
    title = paste("Fragmentation Ratio Track:", SAMPLE_ID),
    subtitle = sprintf("GC-corrected short(90-150bp)/long(151-220bp) ratio in %d Mb bins on chr21", 
                      bin_size / 1e6),
    x = "Genomic Position (Mb)",
    y = "Short/Long Ratio"
  ) +
  theme_pub(base_size = 12)

p_frag_track
Figure 3: Genome-wide fragmentation ratio track (GC-corrected short/long ratio)
Show code
# Gviz visualization
if (requireNamespace("Gviz", quietly = TRUE)) {
  genome_build <- "hg38"
  chrom <- "chr21"
  from_bp <- min(bins_df$start, na.rm = TRUE)
  to_bp <- max(bins_df$end, na.rm = TRUE)
  
  gr_bins <- GenomicRanges::GRanges(
    seqnames = bins_df$chrom,
    ranges = IRanges::IRanges(start = bins_df$start, end = bins_df$end)
  )
  
  y <- bin_tracks$short_long_ratio
  y_lim <- range(y[is.finite(y)], na.rm = TRUE)
  y_pad <- 0.15 * diff(y_lim)
  y_lim <- c(y_lim[1] - y_pad, y_lim[2] + y_pad)
  
  axis_track <- Gviz::GenomeAxisTrack(genome = genome_build, chromosome = chrom, name = "")
  
  ratio_track <- Gviz::DataTrack(
    range = gr_bins,
    data = y,
    genome = genome_build,
    chromosome = chrom,
    name = "S/L ratio",
    type = "l",
    col = "#2E4A62",
    lwd = 2,
    ylim = y_lim,
    baseline = 1,
    col.baseline = "gray45",
    lty.baseline = 2,
    lwd.baseline = 1,
    cex.axis = 0.9,
    cex.title = 0.9
  )
  
  ideo_track <- tryCatch(
    Gviz::IdeogramTrack(genome = genome_build, chromosome = chrom),
    error = function(e) NULL
  )
  
  tracks <- list(axis_track, ratio_track)
  if (!is.null(ideo_track)) tracks <- c(list(ideo_track), tracks)
  
  Gviz::plotTracks(
    tracks,
    from = from_bp,
    to = to_bp,
    background.title = "white",
    col.title = "black",
    cex.title = 0.95,
    background.panel = "white",
    col.axis = "black",
    col.frame = "gray85"
  )
}
Figure 4: Chromosome-style fragmentation ratio visualization

6 Fragment Start Site Analysis

This section analyzes the genomic distribution of fragment 5’ start sites and their enrichment around transcription start sites.

6.1 Genomic Distribution

Show code
# Annotate fragment 5' starts
start_annot <- annotate_fragment_starts_one_sample(
  fragment_bed_gz = fragment_bed_path,
  sample_id = SAMPLE_ID,
  txdb = txdb
)
>> preparing features information...         2026-01-30 15:39:18 
>> Using Genome: hg38 ...
>> identifying nearest features...       2026-01-30 15:39:20 
>> calculating distance from peak to TSS...  2026-01-30 15:39:21 
>> assigning genomic annotation...       2026-01-30 15:39:21 
>> Using Genome: hg38 ...
>> Using Genome: hg38 ...
>> adding flank feature information from peaks...    2026-01-30 15:40:02 
>> assigning chromosome lengths          2026-01-30 15:40:07 
>> done...                   2026-01-30 15:40:07 
Show code
# Prepare data for donut chart
annot_data <- start_annot %>%
  tidyr::pivot_longer(
    cols = c(Promoter, Exon, Intron, Three_UTR, Five_UTR, Distal_Intergenic, Downstream),
    names_to = "annotation",
    values_to = "n"
  ) %>%
  dplyr::mutate(
    pct = 100 * n / sum(n),
    annotation = dplyr::case_when(
      annotation == "Three_UTR" ~ "3' UTR",
      annotation == "Five_UTR" ~ "5' UTR",
      annotation == "Distal_Intergenic" ~ "Distal Intergenic",
      TRUE ~ annotation
    ),
    annotation = factor(annotation, levels = c("Promoter", "5' UTR", "Exon", "Intron", 
                                               "3' UTR", "Downstream", "Distal Intergenic"))
  ) %>%
  dplyr::arrange(annotation) %>%
  dplyr::mutate(
    ymax = cumsum(pct),
    ymin = dplyr::lag(ymax, default = 0),
    label_pos = (ymin + ymax) / 2,
    label = sprintf("%s\n%.1f%%", annotation, pct)
  )

# Create donut chart
p_donut <- ggplot(annot_data, aes(ymax = ymax, ymin = ymin, xmax = 4, xmin = 2.5, fill = annotation)) +
  geom_rect(color = "white", linewidth = 0.5) +
  geom_text(aes(x = 3.25, y = label_pos, label = sprintf("%.1f%%", pct)), 
            size = 3, color = "white", fontface = "bold") +
  coord_polar(theta = "y") +
  xlim(c(1, 4)) +
  scale_fill_brewer(palette = "Set2") +
  annotate("text", x = 1, y = 0, label = paste("Total\n", format(start_annot$total_starts, big.mark = ",")),
           size = 4, fontface = "bold") +
  labs(
    title = "Genomic Distribution of Fragment 5' Starts",
    subtitle = paste("Sample:", SAMPLE_ID),
    fill = "Annotation"
  ) +
  theme_void(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "right"
  )

p_donut
Figure 5: Genomic distribution of fragment 5’ start sites

6.2 TSS Enrichment

Show code
# Get muscle-related genes on chr21
muscle_symbols <- c("COL6A1", "COL6A2")
muscle_map <- AnnotationDbi::select(
  org.Hs.eg.db::org.Hs.eg.db,
  keys = muscle_symbols,
  keytype = "SYMBOL",
  columns = c("SYMBOL", "ENTREZID")
) %>%
  dplyr::filter(!is.na(.data$ENTREZID)) %>%
  dplyr::mutate(ENTREZID = as.character(.data$ENTREZID))

genes_chr21 <- suppressMessages(GenomicFeatures::genes(txdb))
genes_chr21 <- genes_chr21[as.character(GenomeInfoDb::seqnames(genes_chr21)) == "chr21"]
genes_muscle <- genes_chr21[as.character(genes_chr21$gene_id) %in% unique(muscle_map$ENTREZID)]

# TSS windows for muscle genes
tss_points_m <- unique(GenomicRanges::promoters(genes_muscle, upstream = 0L, downstream = 1L))
tss_windows_m <- unique(GenomicRanges::promoters(genes_muscle, upstream = 2000L, downstream = 2000L))
tss_site_m <- start(tss_points_m)
tss_strand_m <- as.character(strand(tss_points_m))

# Compute TSS enrichment profile
tss_profile <- profile_tss_enrichment_starts(
  fragment_bed_gz = fragment_bed_path,
  tss_windows = tss_windows_m,
  tss_site = tss_site_m,
  tss_strand = tss_strand_m,
  flank = 2000L
)
Show code
# Smooth the profile
tss_profile_smooth <- tss_profile %>%
  dplyr::mutate(
    enrichment_smooth = moving_average(enrichment, k = 21),
    enrichment_smooth = dplyr::if_else(is.na(enrichment_smooth), enrichment, enrichment_smooth)
  )

p_tss <- ggplot(tss_profile_smooth, aes(x = rel_bp, y = enrichment_smooth)) +
  geom_hline(yintercept = 1, color = "gray60", linewidth = 0.35) +
  geom_vline(xintercept = 0, color = "gray30", linewidth = 0.5) +
  geom_line(linewidth = 0.9, color = "#2E4A62") +
  annotate("text", x = 50, y = max(tss_profile_smooth$enrichment_smooth, na.rm = TRUE) * 0.95,
           label = "TSS", hjust = 0, size = 4, fontface = "bold") +
  labs(
    title = "TSS Enrichment of Fragment 5' Start Sites",
    subtitle = paste("Muscle genes (COL6A1, COL6A2) on chr21 | Sample:", SAMPLE_ID),
    x = "Distance to TSS (bp)",
    y = "Normalized Fragment Density"
  ) +
  theme_pub(base_size = 12)

p_tss
Figure 6: TSS enrichment of fragment 5’ start sites around muscle genes on chr21

7 End Motif Analysis

Cell-free DNA fragments exhibit characteristic 5’ end motif patterns reflecting the sequence preferences of nucleases involved in cfDNA generation.

Show code
# Extract 4-mer end motifs
motifs_all <- all_4mers()
valid_seqnames <- seqnames(genome)

motif_counts <- extract_4mer_counts_from_frag_bed(
  frag_bed_gz = fragment_bed_path,
  sample_id = SAMPLE_ID,
  genome = genome,
  valid_seqnames = valid_seqnames,
  motifs_all = motifs_all
)

# Calculate frequencies
motif_freq <- motif_counts %>%
  dplyr::mutate(
    total = sum(count),
    frequency = count / total
  )

7.1 Top 20 End Motifs

Show code
# Get top 20 motifs
top20_motifs <- motif_freq %>%
  dplyr::arrange(desc(frequency)) %>%
  dplyr::slice_head(n = 20) %>%
  dplyr::mutate(motif = factor(motif, levels = rev(motif)))

p_top20 <- ggplot(top20_motifs, aes(x = frequency, y = motif)) +
  geom_col(fill = "#3C5488", alpha = 0.9, width = 0.7) +
  geom_text(aes(label = sprintf("%.4f", frequency)), hjust = -0.1, size = 3) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Top 20 cfDNA 5' End Motifs (4-mer)",
    subtitle = paste("Sample:", SAMPLE_ID),
    x = "Frequency",
    y = "Motif"
  ) +
  theme_pub(base_size = 12) +
  theme(panel.grid.major.y = element_blank())

p_top20
Figure 7: Top 20 cfDNA 5’ end motifs (4-mer)

7.2 Motif Diversity Score

The Motif Diversity Score (MDS) is calculated as the normalized Shannon entropy of the 256 4-mer frequencies.

Show code
# Calculate MDS
freq_vec <- motif_freq$frequency
entropy_bits <- shannon_entropy_bits(freq_vec)
max_entropy <- log2(256)  # Maximum possible entropy for 256 motifs
mds <- entropy_bits / max_entropy

mds_display <- tibble::tibble(
  Metric = c("Shannon Entropy (bits)", "Maximum Entropy (bits)", "Motif Diversity Score (MDS)"),
  Value = c(sprintf("%.4f", entropy_bits), sprintf("%.4f", max_entropy), sprintf("%.4f", mds))
)

knitr::kable(mds_display, caption = "End Motif Diversity Metrics") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
End Motif Diversity Metrics
Metric Value
Shannon Entropy (bits) 7.6745
Maximum Entropy (bits) 8.0000
Motif Diversity Score (MDS) 0.9593
Show code
# Create a simple gauge-like visualization for MDS
p_mds <- ggplot() +
  geom_rect(aes(xmin = 0, xmax = 1, ymin = 0, ymax = 1), fill = "gray90") +
  geom_rect(aes(xmin = 0, xmax = mds, ymin = 0, ymax = 1), fill = "#3C5488", alpha = 0.8) +
  geom_text(aes(x = 0.5, y = 0.5, label = sprintf("MDS = %.4f", mds)), 
            size = 8, fontface = "bold", color = "white") +
  geom_text(aes(x = 0.5, y = -0.3, label = "Shannon entropy of 256 4-mer frequencies (normalized)"),
            size = 3.5, color = "gray40") +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  scale_y_continuous(limits = c(-0.5, 1.2)) +
  labs(title = paste("Motif Diversity Score:", SAMPLE_ID)) +
  theme_void(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14, margin = ggplot2::margin(b = 10)),
    axis.text.x = element_text(color = "gray30"),
    axis.ticks.x = element_line(color = "gray30")
  )

p_mds
Figure 8: Motif Diversity Score (MDS) - Normalized Shannon entropy of 256 4-mer frequencies

8 Classification Prediction

This section uses the pre-trained Random Forest model to predict whether the sample belongs to the ALS or Control group based on methylation features.

Show code
# Check if model exists
model_available <- file.exists(MODEL_PATH) && file.exists(STACKHMM_PATH)

if (model_available) {
  # Load pre-trained model
  all_results <- readRDS(MODEL_PATH)
  
  # Load stackHMM annotations
  stackhmm_df <- readr::read_tsv(
    STACKHMM_PATH,
    col_names = c("chr", "start", "end", "state"),
    col_types = "ciic",
    progress = FALSE
  ) %>%
    dplyr::filter(chr == "chr21")
  
  stackhmm_gr <- GenomicRanges::GRanges(
    seqnames = stackhmm_df$chr,
    ranges = IRanges::IRanges(start = stackhmm_df$start + 1L, end = stackhmm_df$end),
    state = stackhmm_df$state
  )
  states <- unique(stackhmm_df$state)
  
  # Load methylation data from bismark coverage file
  bs <- bsseq::read.bismark(
    files = BISMARK_COV_FILE,
    rmZeroCov = TRUE,
    strandCollapse = TRUE,
    verbose = FALSE
  )
  
  # Extract methylation values
  cpg_gr <- GenomicRanges::granges(bs)
  meth_mat <- bsseq::getMeth(bs, type = "raw", what = "perBase")
  
  # Calculate stackHMM methylation features
  stackhmm_meth <- numeric(length(states))
  names(stackhmm_meth) <- states
  
  for (i in seq_along(states)) {
    cpg_idx <- S4Vectors::queryHits(
      GenomicRanges::findOverlaps(cpg_gr, stackhmm_gr[stackhmm_gr$state == states[i]])
    )
    if (length(cpg_idx) > 0) {
      stackhmm_meth[states[i]] <- median(meth_mat[cpg_idx, 1], na.rm = TRUE)
    } else {
      stackhmm_meth[states[i]] <- NA
    }
  }
  
  # Prepare features for prediction (using stackHMM model)
  model_result <- all_results[["stackHMM"]]
  selected_features <- model_result$selected_features
  
  # Create feature matrix
  new_features <- matrix(stackhmm_meth[selected_features], nrow = 1,
                         dimnames = list(SAMPLE_ID, selected_features))
  
  # Handle missing features
  new_features[is.na(new_features)] <- 0
  
  # Make prediction
  prediction <- predict_single_sample(new_features, model_result)
  
  prediction_available <- TRUE
} else {
  prediction_available <- FALSE
  message("Pre-trained model not available. Skipping classification prediction.")
}
Show code
if (prediction_available) {
  # Display prediction results
  pred_class <- prediction$predicted_class
  pred_prob <- prediction$probabilities
  
  pred_color <- if (pred_class == "ALS") "#E64B35" else "#4DBBD5"
  
  cat("\n")
}
Show code
if (prediction_available) {
  # Create prediction visualization
  prob_data <- tibble::tibble(
    Group = c("Ctrl", "ALS"),
    Probability = c(pred_prob$Ctrl, pred_prob$ALS)
  ) %>%
    dplyr::mutate(Group = factor(Group, levels = c("Ctrl", "ALS")))
  
  # Create label for binary prediction
  confidence <- max(pred_prob$Ctrl, pred_prob$ALS) * 100
  prediction_label <- sprintf("Prediction: %s (%.1f%% confidence)", pred_class, confidence)
  
  p_pred <- ggplot(prob_data, aes(x = Group, y = Probability, fill = Group)) +
    geom_col(width = 0.6, alpha = 0.9) +
    geom_text(aes(label = sprintf("%.1f%%", Probability * 100)), 
              vjust = -0.5, size = 5, fontface = "bold") +
    geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray50") +
    annotate("label", x = 1.5, y = 0.95, label = prediction_label,
             size = 5, fontface = "bold", fill = pred_color, color = "white",
             label.padding = unit(0.5, "lines"), label.r = unit(0.3, "lines")) +
    scale_fill_manual(values = COLORS) +
    scale_y_continuous(limits = c(0, 1.15), labels = scales::percent) +
    labs(
      title = "Classification Prediction Result",
      subtitle = paste("Sample:", SAMPLE_ID, "| Model: stackHMM methylation features"),
      x = NULL,
      y = "Prediction Probability"
    ) +
    theme_pub(base_size = 14) +
    theme(legend.position = "none")
  
  print(p_pred)
}
Figure 9: Classification prediction result
Show code
if (prediction_available) {
  pred_table <- tibble::tibble(
    Metric = c("Predicted Class", "Ctrl Probability", "ALS Probability", "Confidence"),
    Value = c(
      pred_class,
      sprintf("%.2f%%", pred_prob$Ctrl * 100),
      sprintf("%.2f%%", pred_prob$ALS * 100),
      sprintf("%.2f%%", max(pred_prob$Ctrl, pred_prob$ALS) * 100)
    )
  )
  
  knitr::kable(pred_table, caption = "Classification Prediction Results") %>%
    kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
    kableExtra::row_spec(1, bold = TRUE, color = "white", background = pred_color)
}
Classification Prediction Results
Metric Value
Predicted Class ALS
Ctrl Probability 25.80%
ALS Probability 74.20%
Confidence 74.20%

9 Summary

Show code
# Compile all key results
summary_results <- tibble::tibble(
  Category = c(
    rep("Quality Control", 5),
    rep("Fragment Analysis", 4),
    rep("End Motifs", 2),
    if (prediction_available) "Classification" else character(0)
  ),
  Metric = c(
    "Total Reads", "Mapping Rate", "Bisulfite Conversion", "CpG Methylation", "Mean MAPQ",
    "Median Fragment Length", "Mode Fragment Length", "Short/Long Ratio", "TSS Enrichment Score",
    "Top Motif", "Motif Diversity (MDS)",
    if (prediction_available) "Predicted Group" else character(0)
  ),
  Value = c(
    format(summary_stats$total_reads, big.mark = ","),
    sprintf("%.2f%%", summary_stats$mapping_rate * 100),
    sprintf("%.2f%%", summary_stats$bisulfite_conversion * 100),
    sprintf("%.2f%%", summary_stats$cpg_methylation * 100),
    sprintf("%.1f", summary_stats$mean_mapq),
    sprintf("%d bp", as.integer(summary_stats$median_frag_length)),
    sprintf("%d bp", summary_stats$frag_mode),
    sprintf("%.3f", short_long_ratio),
    sprintf("%.2f", max(tss_profile$enrichment, na.rm = TRUE)),
    as.character(top20_motifs$motif[1]),
    sprintf("%.4f", mds),
    if (prediction_available) sprintf("%s (%.1f%%)", pred_class, max(pred_prob) * 100) else character(0)
  )
)

knitr::kable(summary_results, caption = paste("Analysis Summary for", SAMPLE_ID)) %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  kableExtra::pack_rows("Quality Control", 1, 5) %>%
  kableExtra::pack_rows("Fragment Analysis", 6, 9) %>%
  kableExtra::pack_rows("End Motifs", 10, 11) %>%
  {if (prediction_available) kableExtra::pack_rows(., "Classification", 12, 12) else .}
Analysis Summary for SRR13404367
Category Metric Value
Quality Control
Quality Control Total Reads 128,966
Quality Control Mapping Rate 100.00%
Quality Control Bisulfite Conversion 99.53%
Quality Control CpG Methylation 82.71%
Quality Control Mean MAPQ 38.1
Fragment Analysis
Fragment Analysis Median Fragment Length 162 bp
Fragment Analysis Mode Fragment Length 160 bp
Fragment Analysis Short/Long Ratio 0.438
Fragment Analysis TSS Enrichment Score -Inf
End Motifs
End Motifs Top Motif AAAA
End Motifs Motif Diversity (MDS) 0.9593
Classification
Classification Predicted Group ALS (74.2%)

10 Methods

10.1 Data Processing

The BAM file was processed using custom R functions that implement the following pipeline:

  1. Read Filtering: Reads were filtered to retain properly paired, non-duplicate alignments with MAPQ ≥ 30.
  2. Fragment Extraction: Fragment coordinates were derived from paired-end alignments using the union span of mate alignments.
  3. GC Content: Fragment GC content was calculated as (G+C)/(A+C+G+T) excluding N bases.
  4. Methylation Extraction: CpG methylation was extracted from Bismark XM tags or coverage files.

10.2 Fragmentation Analysis

  • Insert Size Distribution: Fragment lengths were computed from paired-end alignment coordinates.
  • Short/Long Ratio: Ratio of fragments 90-150 bp (short) to 151-220 bp (long).
  • GC Correction: LOESS regression was used to correct for GC bias in fragment counts.

10.3 End Motif Analysis

  • Motif Extraction: 4-mer sequences were extracted from both 5’ ends of each fragment (strand-aware).
  • Motif Diversity Score: Calculated as normalized Shannon entropy: MDS = H(p) / log₂(256).

10.4 Classification

  • Model: Random Forest classifier trained on ALS vs Control cohort data.
  • Features: stackHMM-based regional methylation levels (100 chromatin states).
  • Validation: Nested cross-validation (LOOCV outer, 5-fold inner) with Boruta feature selection.

10.5 Software

This analysis was performed using R version R version 4.5.1 (2025-06-13) with Bioconductor packages for genomic analysis. All core algorithms are identical to those used in the cohort analysis to ensure reproducibility.


Report generated on 2026-01-30 16:19:22.425818